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Current tlieories of heteropolymers are inherently macroscopic, but are applied to folding proteins which are only 
mesoscopic. In these theories, one computes the averaged free energy over sequences, always assuming that it is 
self-averaging - a property well-established only if a system with quenched disorder is macroscopic. By enumerating 
the states and energies of compact 18, 27, and 36mers on a simplified lattice model with an ensemble of random 
sequences, we test the validity of the self- averaging approximation. We find that fluctuations in the free energy 
between sequences are weak, and that self-averaging is a valid approximation at the length scale of real proteins. 
These results validate certain sequence design methods which can exponentially speed up computational design 
and greatly simplify experimental realizations. 
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Protein folding remains one of the most challenging 
problems in polymer physics 0-^. The phenomenon is 
straightforward - at low temperature a heteropolymer 
chain freezes into a single configuration. However, the 
relationship between a chain's monomer sequence and 
the thermodynamics of its transition is complex. As a 
result, current theories of heteropolymer freezing resort 
to certain assumptions which have not been adequately 
tested, one of the most basic being self-averaging of the 
free energy. 

Self-averaging is a property of many disordered sys- 
tems, stating that the free energy of a system of size N 
with quenched disorder is independent of the particular 
realization of the disorder, to within variations of order 
of \/iV, which are relatively negligible as iV — > oo. This 
property can be rigorously proved for a broad range of 
models in macroscopic disordered systems [||. Funda- 
mentally, it stems from the independence of sub-regions 
in the N oo thermodynamic limit. In the context of 
heteropolymers, it states that a random heteropolymer's 
free energy is independent of its sequence, i.e. 



F{seq,T) ~ {F{seq,T)) 



seq 



(1) 



where 



indicates an average over sequences. There 



have been some proofs of self-averaging for certain het- 
eropolymer models in the N oo limit ||^. Pro- 
teins, however, are mesoscopic objects, and it is unclear 
whether self-averaging applies at the lengths of TV not 
more than several hundred monomers found in proteins. 

Self-averaging in heteropolymers is important for two 
main reasons. First, it is relevant to the theoretical un- 
derstanding of protein folding. Starting from [^,||, key 
modern theories of heteropolymers, reviewed recently in 
0, compute the averaged free energy of the system, im- 
plicitly neglecting sequence-dependent variations in the 
manner of Eq. (1). In particular, self-averaging is an 
element of the replica method, and is used in the deriva- 
tion of the Random Energy Model for heteropolymers 



Q. Second, self-averaging is an important assumption 
of certain sequence design methods such as "imprint- 
ing" 01 and "sequence selection |^," ideas from which 
have been used for de novo protein and ligand design 
1^. These methods have proven useful both experimen- 
tally [0 and computationally. Unfortunately, computa- 
tional design methods that do not assume self-averaging 
JlH require vastly more calculation time. To design a 
heteropolymer sequence to fold into a certain confor- 
mation * at temperature T, one should minimize the 
quantity E{seq, *) — F{seq, T) over all sequences. If self- 
averaging is not assumed, one must calculate the energy 
of all conformations for each sequence tested to deter- 
mine F{seq,T). However, if self-averaging is valid, then 
the F term can be ignored and design can be carried out 
by evaluating the energy of each sequence in just the one 
conformation This exponentially speeds up the design 
procedure. 

In vitro experiments have not yet provided suffi- 
cient evidence to verify self-averaging in random peptide 
chains. In the experiments that have studied random 
amino acid sequences, there have not been any obvious 
trends in the behavior due to the difficulty of such 
experiments and consequent lack of data. However, us- 
ing a computer simulation, we are able to sample many 
more sequences than can be analyzed feasibly in vitro, 
and thus determine whether the property of free energy 
self-averaging over sequences is valid for heteropolymers. 
We perform a scaling comparison of the exact free en- 
ergy and other parameters for several three-dimensional 
lattice heteropolymers of different size. We then extrap- 
olate our data to determine the validity of free energy 
self-averaging for protein-sized polymers. 

In order to study the thermodynamics of random het- 
eropolymers, we perform an exact lattice enumeration of 
the states of compact polymer chains of several different 
lengths. For each length, we examine many random se- 
quences made up of two monomer species, a and (3. All 
included sequences have the same number of monomers 
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of type a and type /3, so as to remove any concentration 
dependence. For each sequence, we evaluate its energy 
in all possible compact conformations. Using this infor- 
mation, we then calculate the free energy F{T) of each 
sequence. To determine whether self-averaging is valid, 
we first compare the average and the standard deviation 
of the free energy over the examined sequences, and then 
examine the dependence of these quantities on the chain 
length N. Our method of enumeration is in contrast with 
other works which use Monte Carlo sampling of states to 
determine averaged thermodynamic properties p|jr^ . By 
doing a full enumeration, we are able to separate thermo- 
dynamic properties of the system from kinetic effects. In 
practice, the method is similar to the procedure used in 
studies of designability |14|, although we only examine 
a finite sample of sequences, rather than testing every 
possible one. In our case, the focus is the complete free 
energy versus temperature curve, rather than just the 
ground state conformation for each sequence. 

We use a standard model in which monomers are 
placed at lattice positions r^, and subject to an energy 
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E 

i<j 



(2) 



where i and j run over the monomers in the chain, and 
Si indicates one of the species (a or /3) of the monomer 
i for a particular sequence {s^}. Contact interactions 
are enforced by setting A(ri — Vj) = 1, if the and rj 
are on neighboring lattice points, and otherwise. In- 
teractions between neighbors along the chain are not in- 
cluded as their total only provides a reference point for 
the other energies. The interactions between monomer 
species are tabulated in a matrix B having mean interac- 
tion B — 0, and standard deviation 6B = 1. These val- 
ues are weighted according to the fraction of monomers 
of each species in the system, i.e. B = iPkBkiPi, 
and 5B^ = iPkiBki — B)^pi, with k and / taking on 
the monomer species types a and /3. With these con- 
straints, homopolymer effects are removed and the freez- 
ing temperature of the system should be of the order of 
6B = 1. Wc first focus on Ising-type interactions, in 
which Baa — Bpp — 1 and Bap = Bpa = —1. 

The restriction to compact conformations is partly dic- 
tated by computational constraints, and allows us to fully 
enumerate much larger values of N than would be pos- 
sible otherwise. This choice is also physically justified 
since, according to the molten globule model of freezing 
0, the available states of proteins at the freezing tran- 
sition are mostly compact. Furthermore, all such com- 
pact configurations have the same number of contacts, 
and therefore energy differences between configurations 
are only due to heteropolymeric contributions. We have 
selected a compact state with interactions switched off 
{B = 0) as our reference zero energy state (i.e. a non- 
interacting compact homopolymer). With this choice, 
the fluctuation in free energy over sequences is the het- 
eropolymeric quantity important to sequence design. 



We enumerated chains of length 18 (3 X 3 X 2), 27 
(3 X 3 X 3), and 36 (3 x 3 x 4). The ratios a : /3 in 
these chains were 9 : 9, 14 : 13, and 18 : 18, respectively. 
We restricted our study to a set of 500 sequences each 
for IBmers and 27mers, and 120 sequences for 36mers 
for reasons of computational tractability. The enumera- 
tion algorithm followed the procedure of Pande et al . 
Computations were carried out on two pentium-II com- 
puters and on a cluster at the University of Minnesota 
Supercomputing Institute over a period of a few months. 
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FIG. 1. Sequence dependent free energies versus tempera- 
ture for three different lengths A'^. The symbols and errorbars 
indicate the averaged free energy, and its standard deviation. 
The solid lines are the free energy curves for a few sample 
sequences. Variations in the free energy are small compared 
with its absolute value at all temperatures. 

Given the limitations of a lattice simulation, we cannot 
address questions that depend on the microscopic details 
of real proteins, and instead focused on general trends 
which should be robust across different polymer mod- 
els. The basic test of self-averaging is whether sequence- 
dependent variations in thermodynamic quantities are 
significant. Let us first review the general features of the 
free energy and its sequence dependent fluctuations: For 
any sequence, the free energy F = E — TS, is expected to 
be linear in temperature at both high and low temper- 
atures. At high temperatures, all states are accessible, 
and the free energy is dominated by TSau{N), where 
Sail (N) is the logarithm of the number of compact con- 
formations of length N. Below its freezing temperature, 
the free energy is controlled by the lowest energy states, 
with a much smaller (possibly zero) slope of tempera- 
ture dependence given by the degeneracy of these states. 
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At these low temperatures, the entropy component of 
the free energy is expected to depend strongly on the se- 
quence , though this contribution is small compared to 
the energy component, which should be proportional to 
N, and equal to within y/N for all sequences [Q. Fig. (1) 
shows a few sample sequences that illustrate this behav- 
ior. The error bars indicate the standard deviation of the 
free energy 5F at each temperature, calculated over the 
ensemble of sequences. One can see that the behavior is 
as expected: at high temperature the curves are parallel 
and linear in T; at low temperature the sequence depen- 
dence is more important - in particular below the freez- 
ing temperature, where the slopes of the curves change 
around T w 1. 

More importantly. Fig. (1) shows that the variations in 
the free energy across sequences are significantly less than 
the absolute value of the free energy. In other words, the 
sequence dependent fluctuations are weak and Eq. (1) 
is a good approximation. At higher temperatures, the 
relative fluctuations become even less significant, because 
of the greater importance of the TS contribution. 
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FIG. 2. The quench- averaged square of the free energy di- 
vided by the free energy variance, as a function of polymer 
length, at 3 different temperatures. The large values of this 
quantity, as well as its increasing trend with A*', imply that 
the free energy variations between sequences will be insignifi- 
cant for polymers of hundreds of monomers, the length scale 
of proteins. 

In Fig. (2) we test self-averaging trends by considering 
the size dependence of the relative variations in the free 
energy. At this stage we should clarify what we mean 
by self-averaging, i.e. the conditions that justify the het- 
eropolymer theories, as well as the fast methods of se- 



quence design. Heteropolymer theories rely on Eq. (1), 
i.e. that fluctuations are small. More precisely, the stan- 
dard deviation among sequences SF{T) must be much 
less than the average value {F{T)). Design algorithms, 
on the other hand, are used to find sequences for which 
E and F are of the same magnitude. Therefore, the fast 
design method of minimizing just E is a sufficient proce- 
dure so long as sequence-dependent fluctuations of F are 
much smaller than F itself, and hence much smaller than 
the sequence dependencies selected into E. Thus the fast 
design methods will be justified under the same condition 
of {F) /6F ^ 1. Another trend that we can look for is 
whether (F) /SF is increasing with N. If this is true, 
then proteins, which have values of N about an order of 
magnitude larger than what we test, should have even 
better self-averaging than our lattice models. We indeed 
expect such a trend as larger values of should include 
more independent subregions, although this notion is im- 
precise and there should be finite size effects p^ . 

As Fig. (2) shows, the results strongly support self- 
averaging: (F) I SF 3> 1 for all the data points at all tem- 
peratures and polymer lengths. Furthermore, (F) /6F 
is increasing in N, which shows that the self- averaging 
is even better justified for larger protein-sized polymers. 
(The reason for plotting in the square of (F) /5F has to 
3 with extensivity, as discussed below.) 
Figure (2) is the main result of this paper. It shows 
lat even for chains as short as 18 monomers, self- 



don suggested in |15| , pi3| , with 9 = 0, tt/S, tt/A, and 
All of these matrices show similar trends, and in 



lible monomers (e.g. proteins), so long as the chains are 



monomer interactions accurately describe the chain en- 
ergies. That is, the number of contacts in a conformation 
should be at least of the order of the number of possible 
monomer-monomer interactions. This would be true, if, 
as is commonly accepted, only a few different interactions 
(e.g. hydrophobicity) are significant - though the actual 
number has been the subject of some scrutiny []l5| . 

A secondary issue related to self-averaging is extensiv- 
ity. Self-averaging is traditionally derived from the idea 
that sub-regions of the system behave independently [|j . 
The free energy of the complete system is approximately 
equal to the sum of the free energies for many small sub- 
regions. Each subregion contains a random realization of 
local disorder, and if the system is large, the sum of the 
free energy over subregions will be independent of the 
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overall quenched disorder. A consequence of such inde- 
pendence of subregions is that both the free energy, and 
its variance, will be extensive, i.e. linear in iV as — > oo . 

Figure (3) tests the extensivity of the free energy at 
three different temperatures; below, close to, and above, 
the freezing transition. As a rough guide, we have in- 
cluded linear fits that pass through the four points at 
N = 0, 18, 27, and 36. Note that while the free energy 
is zero at A^ = (a polymer of length zero has no en- 
ergy or entropy), the asymptotic linear limit for large A^ 
does not have to pass through this A^ = point because 
of subleading surface terms. In order that the different 
temperatures may be better compared, the free energies 
have been divided by temperature, and compared with 
their infinite temperature value of Saii{N). The results 
at T = 3 are practically indistinguishable from SaiiiN), 
and in fact the dependence on A^ shows similar trends 
at all three temperatures. It was shown by Pande et al. 
Ip^l that that Saii {N) has a good linear form when A^ is 
extended to lengths as short as A^ = 48. Because of this, 
we expect extensivity to improve when N is marginally 
larger than what we have tested here. 
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FIG. 3. The dependence of the free energy on N, below, 
near, and above the freezing temperature. Each curve has 
been divided by the corresponding temperature, so as to com- 
pare with the infinite temperature limit provided by the (log- 
arithm of) the number of configurations. The deviations from 
linearity indicate the importance of finite size effects for pro- 
tein sized heteropolymers. 

Since for a globule made up of independently con- 
tributing subunits we would expect the variance in 
the free energy to scale as N as well, the quantity 
{F{T)f /5F{Tf should be proportional to A^. Figure (2) 



suggests that this may be the case at least at low tem- 
peratures. The available data, however, have an upward 
curvature, and the values at A^ = 36 are systematically 
higher than our attempted linear fits. It may well be 
that, as in the case of entropy calculation in Ref. , the 
results for {F{T)) /5F{TY become linear at marginally 
higher values of A^. Despite these deviations from the ex- 
pected asymptotic extensivity, the large magnitude of the 
plotted values justify self-averaging according to Eq. (1). 
It is indeed the very deviations from the asymptotic be- 
havior at these smaller sizes that necessitated the current 
study, as it indicates that protein sized objects are not 
quite extensive in the thermodynamic sense. 

The main conclusion of this work is that sequence- 
dependent fluctuations in the free energy of random het- 
eropolymers are small, even at values of A^ as low as 
A^ — 18. Qualitatively speaking, this means that all ran- 
dom sequences have nearly the same free energy. There 
are also indications that the fluctuations decrease in im- 
portance as N increases. These facts together imply that 
self-averaging will be a good approximation for protein- 
sized heteropolymers. Although there are deviations 
from thermodynamic extensivity at this length scale, 
the key property of self-averaging is verified. This lat- 
ter property is important to sequence design algorithms. 
Our results show that sequence design can be carried 
out without having to calculate the energy of each tested 
sequence in all conformations. Instead, one need only 
calculate the energy of each sequence in the desired con- 
formation. This shortcut vastly reduces the necessary 
computation time. 
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